Maximum reward reinforcement learning: A non-cumulative reward criterion
نویسندگان
چکیده
Existing reinforcement learning paradigms proposed in the literature are guided by two performance criteria; namely: the expected cumulativereward, and the average reward criteria. Both of these criteria assume an inherently present cumulative or additivity of the rewards. However, such inherent cumulative of the rewards is not a definite necessity in some contexts. Two possible scenarios are presented in this paper, and are summarized as follows. The first concerns with learning of an optimal policy that is further away in the existence of a sub-optimal policy that is nearer. The cumulative rewards paradigms suffer from slower convergence due to the influence of accumulating the lower rewards, and take time to fade away the effect of the sub-optimal policy. The second scenario concerns with approximating the supremum values of the payoffs of an optimal stopping problem. The payoffs are non-cumulative in nature, and thus the cumulative rewards paradigm is not applicable to resolve this. Hence, a non-cumulative reward reinforcement-learning paradigm is needed in these application contexts. A maximum reward criterion is proposed in this paper, and the resulting reinforcement-learning model with this learning criterion is termed the maximum reward reinforcement learning. The maximum reward reinforcement learning considers the learning of non-cumulative rewards problem, where the agent exhibits a maximum reward-oriented behavior towards the largest rewards in the state-space. Intermediate lower rewards that lead to sub-optimal policies are ignored in this learning paradigm. The maximum reward reinforcement learning is subsequently modeled with the FITSK-RL model. Finally, the model is applied to an optimal stopping problem with a nature of non-cumulative rewards, and its performance is encouraging when benchmarked against other model. q 2005 Elsevier Ltd. All rights reserved.
منابع مشابه
An Analysis of Feature Selection and Reward Function for Model-Based Reinforcement Learning
In this paper, we propose a series of correlation-based feature selection methods for dealing with high dimensionality in feature-rich environments for modelbased Reinforcement Learning (RL). Real world RL tasks usually involve highdimensional feature spaces where standard RL methods often perform badly. Our proposed approach adopts correlation among state features as a selection criterion. The...
متن کاملOn the Performance of Maximum Likelihood Inverse Reinforcement Learning
Inverse reinforcement learning (IRL) addresses the problem of recovering a task description given a demonstration of the optimal policy used to solve such a task. The optimal policy is usually provided by an expert or teacher, making IRL specially suitable for the problem of apprenticeship learning. The task description is encoded in the form of a reward function of a Markov decision process (M...
متن کاملToward a unified theory of decision criterion learning in perceptual categorization.
Optimal decision criterion placement maximizes expected reward and requires sensitivity to the category base rates (prior probabilities) and payoffs (costs and benefits of incorrect and correct responding). When base rates are unequal, human decision criterion is nearly optimal, but when payoffs are unequal, suboptimal decision criterion placement is observed, even when the optimal decision cri...
متن کاملLoss is its own Reward: Self-Supervision for Reinforcement Learning
Reinforcement learning optimizes policies for expected cumulative reward. Need the supervision be so narrow? Reward is delayed and sparse for many tasks, making it a difficult and impoverished signal for end-to-end optimization. To augment reward, we consider a range of selfsupervised tasks that incorporate states, actions, and successors to provide auxiliary losses. These losses offer ubiquito...
متن کاملSelf-Supervision for Reinforcement Learning
Reinforcement learning optimizes policies for expected cumulative reward. Need the supervision be so narrow? Reward is delayed and sparse for many tasks, making it a difficult and impoverished signal for end-to-end optimization. To augment reward, we consider a range of self-supervised tasks that incorporate states, actions, and successors to provide auxiliary losses. These losses offer ubiquit...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Expert Syst. Appl.
دوره 31 شماره
صفحات -
تاریخ انتشار 2006